zero_one_loss (classification error / 1 - accuracy)#

zero_one_loss is the simplest classification loss: it counts how often the predicted label differs from the true label.

It is a great evaluation metric for “did we get the label right?”, but a poor training objective for gradient-based optimization because it is discontinuous / non-differentiable.

Learning goals#

  • write the binary and multiclass definitions in clean notation

  • understand the link to accuracy and (for binary) the confusion matrix

  • implement zero_one_loss in NumPy (with optional sample_weight)

  • build intuition via threshold and parameter-surface plots (Plotly)

  • see how 0-1 loss is used for selection/optimization in practice (threshold tuning)

Quick import#

from sklearn.metrics import zero_one_loss

Table of contents#

  1. Definition and notation

  2. Intuition: thresholds and decision rules (plots)

  3. NumPy implementation + sanity checks

  4. Using 0-1 loss for selection/optimization

  5. Pros, cons, pitfalls

References (quick)#

  • scikit-learn docs: https://scikit-learn.org/stable/api/sklearn.metrics.html

  • ESL (Hastie, Tibshirani, Friedman): “The Elements of Statistical Learning” (classification + empirical risk)

import numpy as np

import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score, zero_one_loss as sk_zero_one_loss
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(0)

1) Definition and notation#

Assume we have \(n\) examples.

  • True label: \(y_i\)

  • Predicted label: \(\hat{y}_i\)

Per-example 0-1 loss#

\[ \ell_i = \mathbb{1}[\hat{y}_i \ne y_i] \]

Aggregate (count vs mean)#

Unnormalized (count of mistakes):

\[ L_{\text{count}} = \sum_{i=1}^n \mathbb{1}[\hat{y}_i \ne y_i] \]

Normalized (fraction of mistakes):

\[ L = \frac{1}{n} \sum_{i=1}^n \mathbb{1}[\hat{y}_i \ne y_i] \]

This is exactly:

\[ L = 1 - \text{accuracy}. \]

Sample-weighted version#

Given weights \(w_i \ge 0\) (e.g. importance weights, class weights), the normalized weighted 0-1 loss is:

\[ L_w = \frac{\sum_{i=1}^n w_i\,\mathbb{1}[\hat{y}_i \ne y_i]}{\sum_{i=1}^n w_i}. \]

Multiclass and multilabel#

  • Multiclass (\(K\) classes): \(y_i \in \{0,\dots,K-1\}\) and the same formula applies.

  • Multilabel / multioutput: \(y_i\) is a vector. scikit-learn’s zero_one_loss uses subset 0-1 loss:

\[ \ell_i = \mathbb{1}[\hat{\mathbf{y}}_i \ne \mathbf{y}_i] \]

i.e. the whole label vector must match exactly. (This is often stricter than what you want; see pitfalls.)

Bayes optimal decision rule (why argmax probability is optimal)#

Let the model output class probabilities \(p_k(x) = P(Y=k\mid X=x)\). The classifier that minimizes the expected 0-1 loss is:

\[ \hat{y}(x) = \arg\max_k\ p_k(x). \]

Binary case with \(\eta(x)=P(Y=1\mid X=x)\) and equal misclassification costs:

\[ \hat{y}(x)=\mathbb{1}[\eta(x)\ge 1/2]. \]

With costs \(c_{01}\) (false positive) and \(c_{10}\) (false negative), the optimal threshold becomes:

\[ \hat{y}(x)=\mathbb{1}\Big[\eta(x)\ge \frac{c_{01}}{c_{01}+c_{10}}\Big]. \]
def sigmoid(z):
    z = np.asarray(z, dtype=float)
    return np.where(z >= 0, 1.0 / (1.0 + np.exp(-z)), np.exp(z) / (1.0 + np.exp(z)))


def zero_one_loss_np(y_true, y_pred, *, normalize=True, sample_weight=None):
    """NumPy implementation of scikit-learn's zero_one_loss.

    - If y is 1D: counts elementwise mismatches.
    - If y is 2D (multilabel / multioutput): uses subset 0-1 loss (row must match exactly).
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true {y_true.shape} vs y_pred {y_pred.shape}")
    if y_true.ndim == 0:
        raise ValueError("y_true must be 1D or 2D")

    if y_true.ndim == 1:
        incorrect = (y_true != y_pred)
    else:
        incorrect = np.any(y_true != y_pred, axis=1)

    incorrect = incorrect.astype(float)
    n = incorrect.shape[0]

    if sample_weight is None:
        total = float(incorrect.sum())
        return total / n if normalize else total

    w = np.asarray(sample_weight, dtype=float)
    if w.ndim != 1 or w.shape[0] != n:
        raise ValueError(f"sample_weight must be shape (n,), got {w.shape}")

    total = float(np.sum(w * incorrect))
    if not normalize:
        return total

    w_sum = float(w.sum())
    if w_sum == 0:
        return 0.0
    return total / w_sum


def predict_labels_from_proba(p, *, threshold=0.5):
    """Convert probabilities to hard labels.

    - Binary: p is (n,) or (n,2) (assumes column 1 is P(y=1)).
    - Multiclass: p is (n,K) -> argmax.
    """
    p = np.asarray(p, dtype=float)
    if p.ndim == 1:
        return (p >= threshold).astype(int)
    if p.ndim == 2 and p.shape[1] == 2:
        return (p[:, 1] >= threshold).astype(int)
    if p.ndim == 2:
        return np.argmax(p, axis=1)
    raise ValueError(f"p must be 1D or 2D, got shape {p.shape}")


def zero_one_loss_from_proba(
    y_true,
    p,
    *,
    threshold=0.5,
    normalize=True,
    sample_weight=None,
):
    y_pred = predict_labels_from_proba(p, threshold=threshold)
    return zero_one_loss_np(y_true, y_pred, normalize=normalize, sample_weight=sample_weight)


def log_loss_binary(y_true, p, *, sample_weight=None, eps=1e-15):
    y_true = np.asarray(y_true, dtype=float)
    p = np.asarray(p, dtype=float)
    p = np.clip(p, eps, 1 - eps)
    per_sample = -(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))
    if sample_weight is None:
        return float(per_sample.mean())
    w = np.asarray(sample_weight, dtype=float)
    w_sum = float(w.sum())
    if w_sum == 0:
        return 0.0
    return float(np.sum(w * per_sample) / w_sum)


def best_threshold_zero_one(y_true, p, *, sample_weight=None, normalize=True):
    """Find an exact minimizer over thresholds t in [0, 1] (binary, rule: p>=t -> 1).

    The predictions only change when t crosses a value in p, so evaluating t over unique p values
    (plus the endpoints 0 and 1) is enough to find the exact optimum.
    """
    y_true = np.asarray(y_true)
    p = np.asarray(p, dtype=float)
    if y_true.shape != p.shape or p.ndim != 1:
        raise ValueError("y_true and p must be 1D arrays of the same shape")

    if sample_weight is None:
        w = np.ones_like(p, dtype=float)
    else:
        w = np.asarray(sample_weight, dtype=float)
        if w.shape != p.shape:
            raise ValueError("sample_weight must have the same shape as p")

    order = np.argsort(p)
    p_s = p[order]
    y_s = y_true[order]
    w_s = w[order]

    w_pos = w_s * (y_s == 1)
    w_neg = w_s * (y_s == 0)
    cum_pos = np.cumsum(w_pos)
    cum_neg = np.cumsum(w_neg)
    total_neg = float(cum_neg[-1])

    uniq = np.unique(p_s)
    thresholds = np.unique(np.concatenate(([0.0], uniq, [1.0])))
    start = np.searchsorted(p_s, thresholds, side="left")
    before = start - 1
    pos_below = np.where(before >= 0, cum_pos[before], 0.0)
    neg_below = np.where(before >= 0, cum_neg[before], 0.0)

    misclassified = pos_below + (total_neg - neg_below)
    if normalize:
        denom = float(w_s.sum())
        losses = misclassified / denom if denom > 0 else np.zeros_like(misclassified)
    else:
        losses = misclassified

    best_j = int(np.argmin(losses))
    return float(thresholds[best_j]), float(losses[best_j])


def standardize_fit_transform(X):
    X = np.asarray(X, dtype=float)
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std, mean, std


def standardize_transform(X, mean, std):
    X = np.asarray(X, dtype=float)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std


def fit_logistic_regression_gd(
    X_train,
    y_train,
    X_val=None,
    y_val=None,
    *,
    lr=0.2,
    n_steps=300,
    l2=0.0,
    threshold=0.5,
):
    X_train = np.asarray(X_train, dtype=float)
    y_train = np.asarray(y_train, dtype=int)

    n, d = X_train.shape
    w = np.zeros(d, dtype=float)
    b = 0.0

    hist = {
        "step": [],
        "train_log_loss": [],
        "train_zero_one": [],
        "val_log_loss": [],
        "val_zero_one": [],
    }

    for step in range(n_steps + 1):
        z_train = X_train @ w + b
        p_train = sigmoid(z_train)

        hist["step"].append(step)
        hist["train_log_loss"].append(log_loss_binary(y_train, p_train))
        hist["train_zero_one"].append(zero_one_loss_from_proba(y_train, p_train, threshold=threshold))

        if X_val is not None and y_val is not None:
            z_val = np.asarray(X_val, dtype=float) @ w + b
            p_val = sigmoid(z_val)
            hist["val_log_loss"].append(log_loss_binary(y_val, p_val))
            hist["val_zero_one"].append(zero_one_loss_from_proba(y_val, p_val, threshold=threshold))
        else:
            hist["val_log_loss"].append(np.nan)
            hist["val_zero_one"].append(np.nan)

        if step == n_steps:
            break

        # gradient of mean log loss (plus optional L2 penalty)
        grad = p_train - y_train
        grad_w = (X_train.T @ grad) / n + l2 * w
        grad_b = float(grad.mean())

        w -= lr * grad_w
        b -= lr * grad_b

    return w, b, hist

2) Intuition: thresholds and decision rules (plots)#

0-1 loss depends only on hard labels.

In binary classification, many models output a score or probability \(\hat{p}(y=1\mid x)\). To turn that into a label we pick a threshold \(t\):

\[ \hat{y}(t) = \mathbb{1}[\hat{p} \ge t]. \]

As you vary \(t\), the predictions only change when \(t\) crosses one of the predicted probabilities. So the empirical 0-1 loss as a function of \(t\) is a step function (flat most of the time, then jumps).

This is a key reason 0-1 loss is not used as a smooth training objective: small parameter changes often produce no change in 0-1 loss until a point flips sides.

n = 250
x = rng.normal(size=n)
p_true = sigmoid(1.5 * x - 0.3)
y = rng.binomial(1, p_true)

# Pretend these are predicted probabilities from an imperfect model
p_hat = np.clip(p_true + 0.15 * rng.normal(size=n), 1e-3, 1 - 1e-3)

thresholds = np.linspace(0.0, 1.0, 601)
losses = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t) for t in thresholds])
acc = 1.0 - losses

t_best, _ = best_threshold_zero_one(y, p_hat)

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Scatter(
        x=thresholds,
        y=losses,
        name="zero-one loss",
        mode="lines",
        line_shape="hv",
    ),
    secondary_y=False,
)
fig.add_trace(
    go.Scatter(
        x=thresholds,
        y=acc,
        name="accuracy (1 - loss)",
        mode="lines",
        line_shape="hv",
    ),
    secondary_y=True,
)

fig.add_vline(x=0.5, line_dash="dash", line_color="gray", opacity=0.7)
fig.add_vline(x=t_best, line_dash="dot", line_color="crimson")

fig.update_xaxes(title_text="threshold t")
fig.update_yaxes(title_text="0-1 loss", secondary_y=False, range=[0, 1])
fig.update_yaxes(title_text="accuracy", secondary_y=True, range=[0, 1])

fig.update_layout(
    title=f"0-1 loss is a step function of the threshold (one optimal t ≈ {t_best:.3f})",
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0),
)
fig.show()

3) NumPy implementation: sanity checks#

A key property: 0-1 loss is insensitive to confidence.

  • predicting 0.51 vs 0.99 for the positive class gives the same 0-1 outcome (as long as the thresholded label is the same)

  • but a probabilistic loss like log loss will strongly prefer 0.99 over 0.51 when the true label is 1

Let’s verify that our NumPy version matches scikit-learn and highlight the “confidence blindness”.

y_true = np.array([1, 0, 1, 1, 0, 0])
y_pred = np.array([1, 0, 0, 1, 0, 1])

print("numpy (mean):", zero_one_loss_np(y_true, y_pred))
print("sklearn (mean):", sk_zero_one_loss(y_true, y_pred))
print("1 - accuracy_score:", 1 - accuracy_score(y_true, y_pred))
print("numpy (count):", zero_one_loss_np(y_true, y_pred, normalize=False))
print("sklearn (count):", sk_zero_one_loss(y_true, y_pred, normalize=False))

w = np.array([1, 1, 5, 1, 1, 1], dtype=float)
print("\nweighted numpy (mean):", zero_one_loss_np(y_true, y_pred, sample_weight=w))
print("weighted sklearn (mean):", sk_zero_one_loss(y_true, y_pred, sample_weight=w))

# multilabel / multioutput: subset 0-1 loss (row must match exactly)
y_true_ml = np.array([[1, 0, 1], [1, 1, 0], [0, 0, 1]])
y_pred_ml = np.array([[1, 0, 1], [1, 0, 0], [0, 1, 1]])
print("\nmultilabel numpy:", zero_one_loss_np(y_true_ml, y_pred_ml))
print("multilabel sklearn:", sk_zero_one_loss(y_true_ml, y_pred_ml))

# confidence blindness: same hard predictions, different probabilities
y_true = np.array([1, 1, 1, 0, 0])
p_soft = np.array([0.51, 0.55, 0.52, 0.49, 0.45])
p_confident = np.array([0.99, 0.90, 0.80, 0.20, 0.01])

print("\n0-1 loss (soft):", zero_one_loss_from_proba(y_true, p_soft))
print("0-1 loss (confident):", zero_one_loss_from_proba(y_true, p_confident))
print("log loss (soft):", log_loss_binary(y_true, p_soft))
print("log loss (confident):", log_loss_binary(y_true, p_confident))
numpy (mean): 0.3333333333333333
sklearn (mean): 0.33333333333333337
1 - accuracy_score: 0.33333333333333337
numpy (count): 2.0
sklearn (count): 2.0

weighted numpy (mean): 0.6
weighted sklearn (mean): 0.6

multilabel numpy: 0.6666666666666666
multilabel sklearn: 0.6666666666666667

0-1 loss (soft): 0.0
0-1 loss (confident): 0.0
log loss (soft): 0.6392579150890872
log loss (confident): 0.11434965799864971

4) Using 0-1 loss for selection/optimization#

Because 0-1 loss is a step function in the threshold (and in the model parameters), it is typically used as a selection criterion rather than a differentiable training objective.

A very common and practical “optimization” task is threshold tuning:

\[ t^* \in \arg\min_{t\in[0,1]}\ L\bigl(y,\ \mathbb{1}[\hat{p}\ge t]\bigr). \]

This works well because it is a 1D search (grid search or exact search over unique probabilities).

If you care more about one class (asymmetric costs), you can encode that with sample_weight (or with an explicit cost-sensitive threshold rule).

# Grid search threshold (approximate)
thresholds = np.linspace(0.0, 1.0, 2001)
losses_grid = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t) for t in thresholds])
min_loss_grid = float(losses_grid.min())
min_idx = np.where(losses_grid == min_loss_grid)[0]
t_grid = float(thresholds[int(min_idx[0])])
t_grid_low = float(thresholds[int(min_idx[0])])
t_grid_high = float(thresholds[int(min_idx[-1])])

# Exact threshold search (evaluate unique p_hat values)
t_exact, loss_exact = best_threshold_zero_one(y, p_hat)

print(f"grid-search min loss: {min_loss_grid:.4f} (t in [{t_grid_low:.4f}, {t_grid_high:.4f}])")
print(f"exact-search min loss: {loss_exact:.4f} (one optimal t={t_exact:.4f})")

# Weighted: make mistakes on positives 3x more costly
w_pos = np.where(y == 1, 3.0, 1.0)
t_w, loss_w = best_threshold_zero_one(y, p_hat, sample_weight=w_pos)
print(f"weighted best t: {t_w:.4f} (loss={loss_w:.4f})")

losses_unweighted = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t) for t in thresholds])
losses_weighted = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t, sample_weight=w_pos) for t in thresholds])

fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=losses_unweighted, mode="lines", line_shape="hv", name="unweighted"))
fig.add_trace(go.Scatter(x=thresholds, y=losses_weighted, mode="lines", line_shape="hv", name="weighted (pos×3)"))
fig.add_vline(x=t_exact, line_dash="dot", line_color="black")
fig.add_vline(x=t_w, line_dash="dot", line_color="crimson")
fig.update_layout(title="Threshold tuning for 0-1 loss (unweighted vs weighted)")
fig.update_xaxes(title_text="threshold t")
fig.update_yaxes(title_text="0-1 loss", range=[0, 1])
fig.show()
grid-search min loss: 0.2840 (t in [0.4675, 0.6995])
exact-search min loss: 0.2800 (one optimal t=0.6984)
weighted best t: 0.1358 (loss=0.2457)

4.1 Why 0-1 loss is hard to optimize directly (and what we do instead)#

If a classifier depends on parameters \(\theta\) (e.g. linear model weights), the empirical 0-1 loss is:

\[ L(\theta) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[\hat{y}(x_i;\theta) \ne y_i]. \]

This function is:

  • discontinuous / non-differentiable (jumps when a point flips sides)

  • typically non-convex and full of plateaus

  • hard to minimize exactly for most hypothesis classes

So in practice we train with a surrogate loss that is smooth and easier to optimize (e.g. log loss / cross-entropy for logistic regression), and then evaluate with 0-1 loss.

The plots below compare the loss landscapes for a simple 1D logistic model.

n = 120
x = rng.normal(size=n)
x = (x - x.mean()) / x.std()

p_true = sigmoid(2.0 * x - 0.4)
y = rng.binomial(1, p_true)

w_grid = np.linspace(-6, 6, 151)
b_grid = np.linspace(-6, 6, 151)

Z = x[:, None, None] * w_grid[None, None, :] + b_grid[None, :, None]
P = sigmoid(Z)

y_pred = (P >= 0.5).astype(int)
loss01 = (y[:, None, None] != y_pred).mean(axis=0)

eps = 1e-12
P_clip = np.clip(P, eps, 1 - eps)
losslog = -(y[:, None, None] * np.log(P_clip) + (1 - y[:, None, None]) * np.log(1 - P_clip)).mean(axis=0)

# Gradient descent on log loss (same simple 1D model)
w = 0.0
b = 0.0
lr = 0.8
w_path = [w]
b_path = [b]
for _ in range(40):
    z = w * x + b
    p = sigmoid(z)
    grad = p - y
    grad_w = float(np.mean(grad * x))
    grad_b = float(np.mean(grad))
    w -= lr * grad_w
    b -= lr * grad_b
    w_path.append(w)
    b_path.append(b)

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("0-1 loss (threshold=0.5)", "log loss (smooth surrogate)"),
    horizontal_spacing=0.12,
)

fig.add_trace(
    go.Heatmap(x=w_grid, y=b_grid, z=loss01, zmin=0, zmax=1, colorbar=dict(title="0-1")),
    row=1,
    col=1,
)
fig.add_trace(
    go.Heatmap(x=w_grid, y=b_grid, z=losslog, colorbar=dict(title="log")),
    row=1,
    col=2,
)

fig.add_trace(go.Scatter(x=w_path, y=b_path, mode="lines+markers", name="GD path"), row=1, col=1)
fig.add_trace(go.Scatter(x=w_path, y=b_path, mode="lines+markers", showlegend=False), row=1, col=2)

fig.update_xaxes(title_text="w", row=1, col=1)
fig.update_xaxes(title_text="w", row=1, col=2)
fig.update_yaxes(title_text="b", row=1, col=1)
fig.update_yaxes(title_text="b", row=1, col=2)
fig.update_layout(title="0-1 loss is piecewise-constant; log loss provides a smooth optimization landscape")
fig.show()

4.2 Example: train logistic regression (from scratch), evaluate 0-1 loss#

We’ll fit a simple logistic regression model by minimizing log loss with gradient descent, while tracking 0-1 loss on train/validation.

Model:

\[ \hat{p}_i = \sigma(x_i^\top w + b),\qquad \sigma(z)=\frac{1}{1+e^{-z}}. \]

Training objective (mean log loss):

\[ J(w,b) = -\frac{1}{n}\sum_{i=1}^n \Big(y_i\log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big). \]

Then we compute 0-1 loss by thresholding \(\hat{p}\) at \(t=0.5\) (and optionally tuning \(t\) on validation).

X, y = make_blobs(
    n_samples=900,
    centers=2,
    n_features=2,
    cluster_std=2.2,
    random_state=0,
)

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=0,
    stratify=y,
)

X_train_s, mean, std = standardize_fit_transform(X_train)
X_val_s = standardize_transform(X_val, mean, std)

w, b, hist = fit_logistic_regression_gd(
    X_train_s,
    y_train,
    X_val=X_val_s,
    y_val=y_val,
    lr=0.2,
    n_steps=250,
    l2=0.01,
    threshold=0.5,
)

p_val = sigmoid(X_val_s @ w + b)
val_loss_05 = zero_one_loss_from_proba(y_val, p_val, threshold=0.5)
t_best, val_loss_best = best_threshold_zero_one(y_val, p_val)

print(f"val 0-1 loss @ t=0.5: {val_loss_05:.4f}")
print(f"best val threshold: {t_best:.4f} (val 0-1 loss={val_loss_best:.4f})")

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Scatter(x=hist["step"], y=hist["train_log_loss"], name="train log loss"), secondary_y=False)
fig.add_trace(go.Scatter(x=hist["step"], y=hist["val_log_loss"], name="val log loss"), secondary_y=False)

fig.add_trace(
    go.Scatter(x=hist["step"], y=hist["train_zero_one"], name="train 0-1 loss", line_shape="hv"),
    secondary_y=True,
)
fig.add_trace(
    go.Scatter(x=hist["step"], y=hist["val_zero_one"], name="val 0-1 loss", line_shape="hv"),
    secondary_y=True,
)

fig.update_xaxes(title_text="gradient descent step")
fig.update_yaxes(title_text="log loss", secondary_y=False)
fig.update_yaxes(title_text="0-1 loss", secondary_y=True, range=[0, 1])
fig.update_layout(title="Training with log loss; tracking 0-1 loss (step-like)")
fig.show()

# Decision boundary visualization
x0_min, x0_max = X_train_s[:, 0].min() - 0.8, X_train_s[:, 0].max() + 0.8
x1_min, x1_max = X_train_s[:, 1].min() - 0.8, X_train_s[:, 1].max() + 0.8

x0 = np.linspace(x0_min, x0_max, 220)
x1 = np.linspace(x1_min, x1_max, 220)
xx0, xx1 = np.meshgrid(x0, x1)
grid = np.c_[xx0.ravel(), xx1.ravel()]

prob_grid = sigmoid(grid @ w + b).reshape(xx0.shape)

fig = go.Figure()
fig.add_trace(
    go.Contour(
        x=x0,
        y=x1,
        z=prob_grid,
        contours=dict(start=0.0, end=1.0, size=0.1),
        colorscale="RdBu",
        opacity=0.85,
        colorbar=dict(title="P(y=1)"),
        showscale=True,
    )
)

fig.add_trace(
    go.Scatter(
        x=X_train_s[:, 0],
        y=X_train_s[:, 1],
        mode="markers",
        marker=dict(color=y_train, colorscale="Viridis", opacity=0.9, line=dict(width=0.2, color="black")),
        name="train points",
    )
)

fig.update_layout(title="Logistic regression probabilities (0-1 loss comes from thresholding)")
fig.update_xaxes(title_text="x0 (standardized)")
fig.update_yaxes(title_text="x1 (standardized)")
fig.show()
val 0-1 loss @ t=0.5: 0.2074
best val threshold: 0.4675 (val 0-1 loss=0.2000)

Pros / cons and when to use 0-1 loss#

Pros#

  • Highly interpretable: “error rate” (or # mistakes)

  • Threshold/decision-rule focused: directly measures what many applications care about (correct label)

  • Works for multiclass with no extra machinery

  • Aligns with the Bayes classifier under equal misclassification costs (argmax posterior)

Cons#

  • Non-differentiable / discontinuous → not suitable as a gradient-based training loss

  • Ignores confidence and calibration: 0.51 and 0.99 are treated the same after thresholding

  • Can be misleading under class imbalance (a majority-class classifier can look good)

  • Depends on the decision rule (threshold choice, argmax ties, cost-sensitive adjustments)

  • Multilabel subset 0-1 is very strict (one wrong label makes the whole example wrong)

When it’s a good choice#

  • Reporting final performance when all errors are equally costly

  • Comparing classifiers after you have a clear, fixed threshold / decision policy

  • Hyperparameter selection when you truly care about accuracy/error rate (using a validation set)

Common pitfalls + diagnostics#

  • Class imbalance: 0-1 loss/accuracy may hide poor minority performance. Also inspect the confusion matrix; consider balanced accuracy, F1, PR AUC.

  • Wrong threshold: if your positive class is rare or costs are asymmetric, \(t=0.5\) is often not optimal; tune \(t\) or use cost-sensitive decision rules.

  • Multilabel strictness: subset 0-1 can be too harsh; consider Hamming loss, Jaccard score, or per-label F1.

  • Probability quality not measured: two models can have the same 0-1 loss but very different calibration; also report log loss / Brier score if probabilities matter.

  • Test-set threshold tuning: choose thresholds/hyperparameters on validation (or via cross-validation), not on the test set.

Exercises#

  1. Prove that normalized 0-1 loss is exactly \(1-\text{accuracy}\).

  2. Derive the cost-sensitive threshold \(\eta(x)\ge \frac{c_{01}}{c_{01}+c_{10}}\) from expected cost minimization.

  3. Construct two classifiers with the same 0-1 loss but very different log loss. When would you prefer each?

  4. Extend best_threshold_zero_one to return all thresholds achieving the minimum.

  5. For multilabel data, compare subset 0-1 loss vs Hamming loss on a synthetic example and interpret the difference.

References#

  • scikit-learn zero_one_loss: https://scikit-learn.org/stable/api/generated/sklearn.metrics.zero_one_loss.html

  • scikit-learn accuracy_score: https://scikit-learn.org/stable/api/generated/sklearn.metrics.accuracy_score.html

  • Hastie, Tibshirani, Friedman: The Elements of Statistical Learning, Ch. 2 (classification), Ch. 4 (linear methods for classification)